class: center, middle, inverse, title-slide # APSTA-GE 2003: Intermediate Quantitative Methods ## Lab Section 003, Week 8 ### New York University ### 10/27/2020 --- ## Reminders - **Assignment 4** - Due: **10/28/2020 11:55pm (EST)** - Updated Shiny App: [https://tongj.shinyapps.io/2003iqm](https://tongj.shinyapps.io/2003iqm) - **Assignment 5** - Due: **11/06/2020 11:55pm (EST)** - Office hours - Monday 9 - 10am (EST) - Wednesday 12:30 - 1:30pm (EST) - Additional time slots available - Sign-up sheet [HERE](https://docs.google.com/spreadsheets/d/1YY38yj8uCNIm1E7jaI9TJC494Pye2-Blq9eSK_eh6tI/edit?usp=sharing) - Office hour Zoom link - https://nyu.zoom.us/j/97347070628 (pin: 2003) - Office hour notes - Available on NYU Classes --- ## Today's Topics - Review on Multiple Regression - Multiple Regression Analysis --- class: inverse, center, middle # Review on Multiple Regression --- ## Multiple Regression - Linear: straight line `$$Y_i = X_i + \varepsilon$$` - Multiple linear regression: - A linear regression with two or more independent variables - Types of multiple linear regression: - Additive - Interaction --- ## Additive **Additive model equation: ** `$$Y_i = \beta_0 + \beta_1 \cdot X_{i1} + \beta_2 \cdot X_{i2} + ... + \beta_n \cdot X_{ik} + \varepsilon_i, \ \ \ i = 1, ..., n$$` It can also be written as: `$$\mathcal{Y_i} = \beta_0 + \mathcal{X} + \varepsilon_i$$` `$$\mathcal{X} = \beta_1 \cdot X_{i1} + \beta_2 \cdot X_{i2} + ... + \beta_n \cdot X_{ik}$$` This reveals an additive relation: - slopes are independent - regression lines are parallel --- ## Demo 1: Additive ```r dat1 <- read.csv("lung_capacity0.csv") dplyr::glimpse(dat1) ``` ``` ## Observations: 80 ## Variables: 6 ## $ Sex <int> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, … ## $ Height <dbl> 66.9, 67.1, 70.7, 70.9, 73.3, 76.1, 70.9, 71.9, 71.1, … ## $ Smoker <int> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, … ## $ Exercise <int> 22, 23, 21, 18, 31, 39, 26, 0, 18, 22, 19, 22, 15, 1, … ## $ Age <int> 49, 41, 82, 26, 69, 42, 68, 40, 35, 50, 40, 58, 19, 33… ## $ LungCapacitycc <int> 5093, 5116, 5550, 5530, 5929, 6212, 5723, 5133, 5415, … ``` ```r summary(mod1 <- lm(LungCapacitycc ~ Age + Smoker, data = dat1))$coefficients ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5455.194065 131.083801 41.6160808 1.534390e-54 ## Age 2.004431 2.532095 0.7916099 4.310200e-01 ## Smoker -596.342602 68.393054 -8.7193446 4.169716e-13 ``` `$$\text{Lung Capacity} = 5455.194 + 2.004 \cdot \text{Age} - 596.343 \cdot \text{Smoker} + \varepsilon$$` --- ## Demo 1: Additive (Continued) ```r summary(mod1)$coefficients["Smoker", ] ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## -5.963426e+02 6.839305e+01 -8.719345e+00 4.169716e-13 ```
--- ## Interaction **Interaction model equation: ** With two independent variables: `$$Y_i = \beta_0 + \beta_1 \cdot X_{i1} + \beta_2 \cdot X_{i2} + \beta_3 \cdot X_{i1} X_{i2} + \varepsilon_i, \ \ \ i = 1, ..., n$$` With three independent variables: `$$Y_i = \beta_0 + \beta_1 \cdot X_{i1} + \beta_2 \cdot X_{i2} + \beta_3 \cdot X_{i3} + \\ \beta_4 \cdot X_{i1} X_{i2} + \beta_5 \cdot X_{i2} X_{i3} + \beta_6 \cdot X_{i1} X_{i3} + \varepsilon_i, \ \ \ i = 1, ..., n$$` This reveals an interactive relation: - slopes are related - regression lines are non-parallel --- ## Demo 2: Interaction [Stress Data](https://rdrr.io/cran/datarium/man/stress.html) ```r *dat2 <- datarium::stress dplyr::glimpse(dat2) ``` ``` ## Observations: 60 ## Variables: 5 ## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, … ## $ score <dbl> 95.6, 82.2, 97.2, 96.4, 81.4, 83.6, 89.4, 83.8, 83.3, 85.7,… ## $ treatment <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,… ## $ exercise <fct> low, low, low, low, low, low, low, low, low, low, moderate,… ## $ age <dbl> 59, 65, 70, 66, 61, 65, 57, 61, 58, 55, 62, 61, 60, 59, 55,… ``` ```r summary(mod1 <- lm(score ~ treatment + exercise + age, data = dat2))$coefficients ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 55.72934446 10.9188850 5.10394099 4.272045e-06 ## treatmentno 4.32528628 1.3774437 3.14008203 2.716774e-03 ## exercisemoderate 0.08735408 1.6903218 0.05167896 9.589717e-01 ## exercisehigh -9.61841026 1.8474096 -5.20643089 2.955621e-06 ## age 0.49811005 0.1764842 2.82240615 6.621719e-03 ``` --- ## Demo 2: Interaction (Continued)
--- class: inverse, center, middle # Multiple Regression Analysis --- ## The data [Marketing](https://rdrr.io/cran/datarium/man/marketing.html) A data frame containing the impact of three advertising medias (youtube, facebook and newspaper) on sales. Data are the advertising budget in thousands of dollars along with the sales. The advertising experiment has been repeated 200 times. ```r *dat3 <- datarium::marketing dplyr::glimpse(dat3) ``` ``` ## Observations: 200 ## Variables: 4 ## $ youtube <dbl> 276.12, 53.40, 20.64, 181.80, 216.96, 10.44, 69.00, 144.24,… ## $ facebook <dbl> 45.36, 47.16, 55.08, 49.56, 12.96, 58.68, 39.36, 23.52, 2.5… ## $ newspaper <dbl> 83.04, 54.12, 83.16, 70.20, 70.08, 90.00, 28.20, 13.92, 1.2… ## $ sales <dbl> 26.52, 12.48, 11.16, 22.20, 15.48, 8.64, 14.16, 15.84, 5.76… ``` --- ## Fit an additive model Fit an additive regression model using `sales` as dependent variable, and `youtube`as the predictor. -- ``` ## ## Call: ## lm(formula = sales ~ youtube, data = dat3) ## ## Residuals: ## Min 1Q Median 3Q Max ## -10.0632 -2.3454 -0.2295 2.4805 8.6548 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 8.439112 0.549412 15.36 <2e-16 *** ## youtube 0.047537 0.002691 17.67 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.91 on 198 degrees of freedom ## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099 ## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16 ``` --- ## Summary: additive model ```r summary(mod_add0)$coefficients ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 8.43911226 0.549411528 15.36028 1.40630e-35 ## youtube 0.04753664 0.002690607 17.66763 1.46739e-42 ``` -- `$$\text{sales} = 8.439 + 0.048 \cdot \text{youtube} + \varepsilon$$` --- ## Fit another additive model Fit another additive regression model using `sales` as dependent variable, and `youtube` and `facebook` as predictors. -- ```r mod_add1 <- lm(sales ~ youtube + facebook, data = dat3) summary(mod_add1) ``` ``` ## ## Call: ## lm(formula = sales ~ youtube + facebook, data = dat3) ## ## Residuals: ## Min 1Q Median 3Q Max ## -10.5572 -1.0502 0.2906 1.4049 3.3994 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.50532 0.35339 9.919 <2e-16 *** ## youtube 0.04575 0.00139 32.909 <2e-16 *** ## facebook 0.18799 0.00804 23.382 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.018 on 197 degrees of freedom ## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8962 ## F-statistic: 859.6 on 2 and 197 DF, p-value: < 2.2e-16 ``` --- ## Summary: additive model ```r summary(mod_add1)$coefficients ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.50531989 0.353387614 9.919193 4.565557e-19 ## youtube 0.04575482 0.001390356 32.908708 5.436980e-82 ## facebook 0.18799423 0.008039973 23.382446 9.776972e-59 ``` -- `$$\text{sales} = 3.505 + 0.046 \cdot \text{youtube} + 0.188 \cdot \text{facebook} + \varepsilon$$` --- ## Visualize: additive model
--- ## Fit one more additive model Fit another additive regression model using `sales` as dependent variable, and `youtube`, `facebook`, and `newspaper` as predictors. -- ```r mod_add2 <- lm(sales ~ youtube + facebook + newspaper, data = dat3) summary(mod_add2) ``` ``` ## ## Call: ## lm(formula = sales ~ youtube + facebook + newspaper, data = dat3) ## ## Residuals: ## Min 1Q Median 3Q Max ## -10.5932 -1.0690 0.2902 1.4272 3.3951 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.526667 0.374290 9.422 <2e-16 *** ## youtube 0.045765 0.001395 32.809 <2e-16 *** ## facebook 0.188530 0.008611 21.893 <2e-16 *** ## newspaper -0.001037 0.005871 -0.177 0.86 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.023 on 196 degrees of freedom ## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956 ## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16 ``` --- ## Summary: additive model ```r summary(mod_add2)$coefficients ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.526667243 0.374289884 9.4222884 1.267295e-17 ## youtube 0.045764645 0.001394897 32.8086244 1.509960e-81 ## facebook 0.188530017 0.008611234 21.8934961 1.505339e-54 ## newspaper -0.001037493 0.005871010 -0.1767146 8.599151e-01 ``` -- `$$\text{sales} = 3.527 + 0.046 \cdot \text{youtube} + 0.189 \cdot \text{facebook} - 0.001 \cdot \text{newspaper} + \varepsilon$$` --- ## Fit an interactive model Fit an interactive regression model using `sales` as dependent variable, and `youtube` and `facebook` as predictors. ``` ## ## Call: ## lm(formula = sales ~ youtube * facebook, data = dat3) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.6039 -0.4833 0.2197 0.7137 1.8295 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 8.100e+00 2.974e-01 27.233 <2e-16 *** ## youtube 1.910e-02 1.504e-03 12.699 <2e-16 *** ## facebook 2.886e-02 8.905e-03 3.241 0.0014 ** ## youtube:facebook 9.054e-04 4.368e-05 20.727 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.132 on 196 degrees of freedom ## Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673 ## F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16 ``` --- ## Summary: interactive model ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 8.1002642437 2.974456e-01 27.232755 1.541461e-68 ## youtube 0.0191010738 1.504146e-03 12.698953 2.363605e-27 ## facebook 0.0288603399 8.905273e-03 3.240815 1.400461e-03 ## youtube:facebook 0.0009054122 4.368366e-05 20.726564 2.757681e-51 ``` `$$\text{sales} = 8.100 + 0.019 \cdot \text{youtube} + 0.289 \cdot \text{facebook} + 0.001 \cdot \text{youtube} \times \text{facebook}$$` --- ## Visualize: interactive model ```r p <- ggplot(dat3, aes(x = youtube, y = sales, color = facebook)) + geom_point() + geom_smooth(method = "lm", se = FALSE, data = dat3[dat3$facebook > 25, ]) + geom_smooth(method = "lm", se = FALSE, data = dat3[dat3$facebook <= 25, ]) + theme_minimal() ggplotly(p) ```
--- ## Contact Tong Jin - Email: tj1061@nyu.edu - Office Hours - Mondays, 9 - 10am (EST) - Wednesdays, 12:30 - 1:30pm (EST)